Variant Discovery ◾ 111
sequencing, variant calling software that created the file, or the reference genome used for
determining variants. The first few lines of metadata section describe file format, file date,
source program, and the reference used. The metadata section also declares and describes
the fields provided at both the site-level (INFO) and sample-level (FORMAT) in the data
lines of the data section. For example, the following metadata lines describe the ID, data
type, and description of some fields that can be found in the INFO and FORMAT columns
in the data section:
##INFO=<ID=NS,Number=1,Type=Integer,Description=”Number of Samples
Data”>
##INFO=<ID=DP,Number=1,Type=Integer,Description=”Total Depth”>
##INFO=<ID=AF,Number=A,Type=Float,Description=”Allele Frequency”>
##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype
Quality”>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Read Depth”>
The data section begins with a tab-delimited single header line that has eight mandatory
fields representing columns for each data line (Table 4.1). The column headers of the data
section are as follows:
#CHROM POS ID REF ALT QUAL FILTER INFO
Only if there is genotype data, then a FORMAT column is declared and followed by unique
sample names. All of these column names must be separated by tabs as well. Each line in
the data section represents a position in the genome. The data corresponds to the columns
specified in the header and must be separated by tabs and ended with a new line. Below are
the columns and their expected values. In all cases, MISSING values should be represented
by a dot (“.”).
As shown in Figure 4.1, the variants are in chromosome 20 on the reference genome
NCBI36 (hg18). The figure shows five positions whose coordinates are 14370, 18330,
TABLE 4.1 VCF File Columns
Column #
Column
Description
1
#CHROM
A chromosome identifier (e.g., 11, chr11, X or chrX)
2
POS
A reference position (sorted numerically in ascending order by chromosome)
3
ID
Variant IDs separated by semicolons (no whitespaces allowed)
4
REF
A reference base (A, C, G, or T). Insertions are represented by a dot
5
ALT
A comma-separated alternate base(s) (A, C, G, or T). Deletions are represented by
a dot
6
QUAL
A quality score in a log scale (Phred quality score)
7
FILTER
This indicates which filters failed (semicolon-separated), PASS or MISSING
8
INFO
A site-level information in semicolon-separated name-value format
9
FORMAT
A sample-level field name declarations separated by semicolons